The Effect of Stemming on Arabic Text Classification: An Empirical Study
نویسندگان
چکیده
The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information retrieval systems to deal with the large number of documents on the web. Many research papers, conducted within the field of text classification, were applied to English, Dutch, Chinese, and other languages, whereas fewer were applied to Arabic language. This paper addresses the issue of automatic classification or classification of Arabic text documents. It applies text classification to Arabic language text documents using stemming as part of the preprocessing steps. Results have showed that applying text classification without using stemming; the support vector machine (SVM) classifier has achieved the highest classification accuracy using the two test modes with 87.79% and 88.54%. On the other hand, stemming has negatively affected the accuracy, where the SVM accuracy using the two test modes dropped down to 84.49% and 86.35%. DOI: 10.4018/ijirr.2011070104 International Journal of Information Retrieval Research, 1(3), 54-70, July-September 2011 55 Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Mesleh et al., 2007; Rahman et al., 2003; Zubi, 2009; Al-Harbi et al., 2008). Several researches applied text classification and its techniques to English and other European languages. On the other hand, few researchers have addressed the issue of Arabic text classification. Text preprocessing and preparation; especially for Arabic, is a crucial task in several applications including; information retrieval, text mining, and natural language processing where the processing tasks include different stages such as: stop word removal and stemming. Stemming tries to reduce a word to its stem (Al-Shammari et al., 2008), stemming process uses word morphological analysis in order to get the word’s stems (Sembok et al., 2011). Stemming is very important technique that usually used in information retrieval and data mining as well as many other NLP applications. Stemming is important for some natural languages and unimportant in others. As reported by Sembok et al. (2011) and Al-Shammari (2008), stemming has the following benefits: • Stemming helps in reducing the size of the
منابع مشابه
Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملNew stemming for arabic text classification using feature selection and decision trees
In this paper we conduct a comparative study between two stemming algorithms: khoja stemmer and our new stemmer for Arabic text classification (categorization), using Chisquare statistics as feature selection and focusing on decision tree classifier. Evaluation used a corpus that consists of 5070 documents independently classified into six categories: sport, entertainment, business, middle east...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملA Study of Text Preprocessing Tools for Arabic Text Categorization
Text preprocessing is an essential stage in text categorization (TC) particularly and text mining generally. Morphological tools can be used in text preprocessing to reduce multiple forms of the word to one form. There has been a debate among researchers about the benefits of using morphological tools in TC. Studies in the English language illustrated that performing stemming during the preproc...
متن کاملRational Kernels for Arabic Stemming and Text Classification
In this paper, we address the problems of Arabic Text Classification and stemming using Transducers and Rational Kernels. We introduce a new stemming technique based on the use of Arabic patterns (Pattern Based Stemmer). Patterns are modelled using transducers and stemming is done without depending on any dictionary. Using transducers for stemming, documents are transformed into finite state tr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJIRR
دوره 1 شماره
صفحات -
تاریخ انتشار 2011